Welcome to the getting started tutorial for EpiData's Jupyter Notebook inteface. In this tutorial we will query, retrieve and analyze sample weather data acquired from a simulated wireless sensor network.
Note: The tutorial assumes that you have a working knowledge of Jupyter Notebook.
As a first step, we will import packages and modules required for this tutorial. Since EpiData Context (ec) is required to use the application, it is implicitly imported. Other modules, such as datetime, pandas and matplotlib, can be imported at this time. Let's run the cell below to import these modules.
In [ ]:
#from epidata.context import ec
from datetime import datetime, timedelta
import pandas as pd
import matplotlib.pyplot as plt
Data stored in the database can be queried by specifying the values of primary keys, start time and stop time. Below are the required primary keys for the current dataset:
In several cases, one may not know the values stored in the primary keys. We have provided the ec.list_keys() function to obtain valid combination of primary keys values. Let's run the code below to see these values for sample weather data.
In [ ]:
keys = ec.list_keys()
keys.toPandas()
Now that we know the valid primary keys values for sample weather data, we can specify them in ec.query_measurements_original() function. The function outputs the query result as an EpiData DataFrame.
In [ ]:
primary_key={"company": "EpiData", "site": "San_Jose", "station":"WSN-1", "sensor": ["Temperature_Probe","Anemometer","RH_Probe"]}
start_time = datetime.strptime('8/1/2017 00:00:00', '%m/%d/%Y %H:%M:%S')
stop_time = datetime.strptime('8/31/2017 00:00:00', '%m/%d/%Y %H:%M:%S')
df = ec.query_measurements_original(primary_key, start_time, stop_time)
Data is retrieved from the database as an EpiData dataframe. To optimize memory and compute resources, we can reduce the size of the data by using the df.select() function. In the cell below, we will select the fields of interest, retrieve the data and count the number of records.
In [ ]:
df = df.select("site", "station", "ts", "meas_name", "meas_value", "meas_unit")
print "Number of records:", df.count()
Data can also be retreived as pandas DataFrame using the toPandas() function. Let's perform this operation and take a look at the initial 5 records of our sample data.
In [ ]:
dflocal = df.toPandas()
dflocal.head(5)
Once data is available in a pandas DataFrame, we can call any of the high-performance and easy-to-use data analysis functions available in pandas library. Let's start by computing basic statistics such as min, max, mean, standard deviation and percentile for temperature measurements.
In [ ]:
dflocal = dflocal.loc[dflocal["meas_name"]=="Temperature"]
dflocal["meas_value"].describe()
Next, we'll look at the distribution of the temperature measurements using a histogram.
In [ ]:
plt.rcParams["figure.figsize"] = [10,5]
plt.title("Histogram - Temperature Measurements")
plt.xlabel("Temperature (deg F)")
plt.ylabel("Frequency")
dflocal["meas_value"].hist()
As we can see, most of the temperature measurements in our sample data are quite moderate. However, there are some measurements that are unusually high. Let's identify these outlier measurements using a simple method that compares each measurement with the sample mean and standard deviation.
In [ ]:
outliers = dflocal.loc[abs(dflocal["meas_value"] - dflocal["meas_value"].mean()) > abs(3*dflocal["meas_value"].std())]
print "Number of Outliers:", outliers["meas_value"].count()
outliers.head()
Congratulations, you have successfully queried, retrieved and analyzed sample data aquired by a wireless sensor network. The next step is to explore the various capabilities of EpiData by creating your own Jupyter Notebook. Happy Data Exploring!